diabetes_mellitus
hypertension no yes
no 220 31
yes 41 106
Pearson's Chi-squared test with Yates' continuity correction
data: tab
X-squared = 144.02, df = 1, p-value < 2.2e-16
[1] "Cramér's V = 0.607"
22160 - R for Bio Data Science
The dataset contains 25 features related to chronic kidney disease, collected from 400 individuals in India. In addition to chronic kidney disease (CKD), there is information on co-diagnoses: hypertension, diabetes, anemia, pedal edema, coronary artery disease
Can we identify any physiological markers which are related to a chronic kidney disease diagnosis? If so, which ones?
Data cleaning and augmentation was done using the Tidyverse collection of packages.
Cleaning: Renaming columns and fixing variable types.
Augmenting: Divide into age groups, split and join, estimate globular filtration rate (GFR)
We conducted a correlation analysis and random forest prediction of which biomarkers best predict a CKD diagnosis. For this, we utilized the PerformanceAnalytics and randomForest packages.
Using the equation below, we could estimate GFR and the different stages of CKD people were in. Due to lack of sex data, we estimated an average of male and female GFR values. \[ \text{eGFR}_{\text{cr}} = 142 \times \min\left(\frac{\text{Scr}}{\kappa},\, 1\right)^{\alpha}\times \max\left(\frac{\text{Scr}}{\kappa},\, 1\right)^{-1.200}\times 0.9938^{\text{Age}}\times 1.012 \;\; \text{[if female]} \]
Hypertension and diabetes was only present in those with a CKD diagnosis.
diabetes_mellitus
hypertension no yes
no 220 31
yes 41 106
Pearson's Chi-squared test with Yates' continuity correction
data: tab
X-squared = 144.02, df = 1, p-value < 2.2e-16
[1] "Cramér's V = 0.607"
Findings:
Caveats and possible improvements:
Small dataset
GFR estimate done without information on sex, meaning decreased accuracy
More information on the data source needed for more accurate conclusions
Data:
Chronic KIdney Disease dataset. Kaggel.com. Available: https://www.kaggle.com/datasets/mansoordaku/ckdisease/data
Packages:
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., et al. (2019). Welcome to the Tidyverse. Journal of Open Source Software 4, 1686. https://doi.org/10.21105/joss.01686.
Peterson, B.G., Carl, P., Boudt, K., Bennett, R., Ulrich, J., Zivot, E., Cornilly, D., Hung, E., Lestel, M., Balkissoon, K., et al. (2024). PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis.
A.C. (Fortran, port), A.L. (R, and port), M.W. (R (2024). randomForest: Breiman and Cutlers Random Forests for Classification and Regression.
Miscellaneous:
CKD-EPI Creatinine Equation (2021) | National Kidney Foundation.
Kaufman, D.P., Basit, H., and Knohl, S.J. (2025). Physiology, Glomerular Filtration Rate. In StatPearls, (Treasure Island (FL): StatPearls Publishing), p.